Working with files

Most software needs to interact with the world one way or another.

The most important Unix paradigm for programs is a file. In Unix philosophy almost everything is a file and therefore being able to work with files gives you control over many things.

File I/O

To read files, there exists a built-in called open. Open takes a path/filename and a mode.

In Python, a file can be opened in the following modes

reading, "r"
writing, "w" (destroys everything in file)
appending, "a" (appends to the end of the file

By default, a file is opened in text mode, that is to say the contents of the file are interpreted as text. If you want to handle the file as binary data, you can open the file in binary mode, either "rb", "wb" and "ab" depending on which mode you want to open it.

The readline() function reads a file until a newline "\n" character is reached and returns that string. A file object is iterable so you can also iterate over lines using a for statement.

You should always take care to remember to close any files you've opened.



In [ ]:

    
filename = "../data/example_file.txt"
fp = open(filename, "w")
for string in ["Hello", "Hey", "moi"]:
    fp.write(string + "\n")
fp.close()

fp = open(filename, "r")
for line in fp:
    print(line.strip()) # the \n is contained in the line, calling strip removes whitespace at the end and beginning
fp.close()

Context handlers

Python also has a nifty syntax called a context handler for dealing with among others file-like objects. The context handler takes care of calling the open and close functions and deals with closing the file no matter what happens.

It makes for somewhat cleaner syntax as a whole.



In [ ]:

    
with open(filename, "r") as file:
    for line in file:
        print(line.strip())

File i/o packages in stdlib

Compressed files

Python standard library contains modules for dealing with gzip (.gz), bzip2 (.bz2) and the less used LZMA algorithms. Additionally there is support for opening ZIP and tar archives, which may contain multiple files. Information can be found in the documentation.

There are more tools in the Python Package Index for many other formats.

The beauty of handling compressed files is that the abstraction level is essentially the same as working with an uncompressed file.



In [ ]:

    
import gzip 
# the library gzip offers an API like open(), see https://docs.python.org/3/library/gzip.html

zipped_file_name = "../data/zipped_file.gz"
with gzip.open(zipped_file_name, "wt") as zipped_file:
    for line in ["This", "is", "an", "example", "."]:
        zipped_file.write((line + "\n"))



In [ ]:

    
## Go ahead, try to read the lines read from zipped_file_name and print them.
## It goes just like in the examples above, except with gzip.open instead of open
## as this is text, you'll need to open the file in mode "rt" and not just r

The implementation technical details like the "rt" vs "r" mode vary a bit, check documentation when unsure.

Data interchange formats

Data can be stored in myriad ways.

Very common is the so-called Comma-Separated Values

header1,header
1,0
0,1
1,0

Another common one is JSON

{"key": "value", "key2": "value2"}

XML is, of course an alternative

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
</data>

XML examples would require such in-depth knowledge of XML that they are not covered in this notebook. Suffice to say that it is possible to handle XML files.

Many software packages read their configurations from files in the INI format

[default]
value = 5

[special_configs]
bigger_value = 6

These 4 are mentioned as examples because there are libraries in the Python Standard Library:

Of special interest is also Python's internal pickle expressly for the purpose of serializing Python objects.



In [ ]:

    
# we will use this object throughout the examples to illustrate different file format handling
pythons = [
    {"name": "Graham Chapham", "birthyear": 1941, "dead": True},
    {"name": "Eric Idle", "birthyear": 1943, "dead": False},
    {"name": "Terry Gilliam", "birthyear": 1940, "dead": False},
    {"name": "Terry Jones", "birthyear": 1942, "dead": False},
    {"name": "John Cleese", "birthyear": 1939, "dead": False},
    {"name": "Michael Palin", "birthyear": 1939, "dead": False},
]

CSV

The Comma-separated values format seems deceptively simple at first and a casual reader can be tempted into trying to create a parser themselves.

"What could possibly go wrong? It's a really simple format after all." 
- every starting developer at least once in their career

The number of different conventions makes parsing all kinds of CSV files highly nontrivial and it is therefore good that there is a separate library for that purpose.

There are two simple ways to use the built-in csv -library:

with headers
without headers

Without headers

to read call csv.reader() with an open file
- iterate over returned object
to write call csv.writer with an open file
- call the writerow() -function of the returned object

With headers

to read call csv.DictReader() with an open file
- iterate over returned object, returned values are dict
to write call csv.DictWriter() with an open file and fieldnames as a list
- call the writeheader() function to write the header row
- call the writerow() function of the returned object with a ``dict```



In [ ]:

    
import csv
filename = "../data/example.csv"



with open(filename, "w") as file_:
    writer = csv.DictWriter(file_, fieldnames=["name", "birthyear", "dead"])
    writer.writeheader()
    for performer in pythons:
        writer.writerow(performer)



In [ ]:

    
def print_performer_dict(performer):
    import datetime
    this_year = datetime.datetime.now().year
    if performer["dead"].lower() == "true":
        print("%s is dead" % performer["name"])
    else:
        print("%s turns %d this year" % (performer["name"], 
                                         this_year - int(performer["birthyear"])))

        
with open(filename, "r") as file_:
    reader = csv.DictReader(file_)
    for performer in reader:
        print_performer_dict(performer)

Note how the truth value and number needed a bit of tinkering. This is one of the downsides of the CSV format, there is no agreed upon way to mark what is a string and what is a number and what is a boolean value.

JSON

JSON is a data interchange format of the web age. It has several flaws, like CSV but yet it is widely used.

In Python, one can usually convert dicts as JSON hashes and lists as JSON lists, with some minor caveats. JSON doesn't differentiate between lists and sets, and Python only permits immutable objects as dictionary keys. Also, there is no simple agreed upon way to encode dates and times in JSON (ISO8601 for human-readable dates and possibly Unix timestamps for machine readable dates are recommended).

Also, the default json library may not be optimal in many respects. There are alternatives, like

The different libraries convert corner cases differently and it's usually not a good idea to use JSON as a persistence format between multiple Python softwares.

However the requirement does not rise very often when dealing with Internet-based systems.

The dumpand load functions operate directly on files and take a filelike object as a parameter. The dumpsand loadsfunctions return and read a string, which is what the s stands for.



In [ ]:

    
import json

# we have two strategies, store the entire object as JSON or store each row as a separate JSON object,
# both exist in the wild world so both will be shown
# fortunately our dicts only contain very simple datums so there will be no issue
ex_1_file = "../data/example_json_1.json"
ex_2_file = "../data/example_json_2.json"

with open(ex_1_file, "w") as file_:
    json.dump(pythons, file_)

with open(ex_2_file, "w") as file_:
    for performer in pythons:
        json.dump(performer, file_)
        file_.write("\n")



In [ ]:

    
#reading back

def print_performer_dict_2(performer):
    import datetime
    this_year = datetime.datetime.now().year
    if performer["dead"]:
        print("%s is dead" % performer["name"])
    else:
        print("%s turns %d this year" % (performer["name"], 
                                         this_year -performer["birthyear"]))

with open(ex_1_file, "r") as file_:
    data = json.load(file_)
    for performer in data:
        print_performer_dict_2(performer)
print("####")
with open(ex_2_file, "r") as file_:
    for line in file_:
        performer = json.loads(line)
        print_performer_dict_2(performer)

Pickle

The simplest way to store Python objects is pickle. It is the standard way to serialize and deserialize Python objects.

Pickle serializes Python objects into strings that can be unpickled by other Python processes and threads. It can pickle almost any data presented in Python. The tricky part is ensuring that both Python processes have the same version (pickle is not backwards-compatible) and that they have the same versions of all relevant libraries.

Another caveat is that other programming languages don't support pickle, it's Python-only.



In [ ]:

    
import pickle

pickled_pythons = pickle.dumps(pythons) #pickle also has dump and dumps like json
#we could write pickled_pythons to a file here if we wanted to, but that's not really the point of the exercise
unpickled_pythons = pickle.loads(pickled_pythons)

print(str(pythons) == str(unpickled_pythons))

The beautiful thing about pickle is that it will serialize complex objects and deserialize them the same way.

For example when training classifiers and regressors in machine learning one can train a classifier on a powerful computer for a long time until the algorithm converges, pickle the resulting object (classifier or regressor) and distribute that to other machines.

io-module

Many Python libraries operate on files or strings. Some library writers assume that everyone will always want to pass a file to their library. Others assume that the results should always be written to a file on the filesystem even when that is not strictly necessary.

For that purpose the io library in Python offers tools to create objects that look like files, even when they aren't.

There are two classes

StringIO for personating a file opened in text-mode
BinaryIO for personating a file opened in bytes-mode



In [ ]:

    
import io

my_output = io.StringIO()
writer = csv.DictWriter(my_output, fieldnames=["name", "birthyear", "dead"])
writer.writeheader()
writer.writerows(pythons)
file_contents = my_output.getvalue()
print("file contents would have been:\n")
print(file_contents)

print("---")
#let's construct another StringIO and use csv to read from a string and not a file
my_input = io.StringIO(file_contents)
reader = csv.DictReader(my_input)
for line in reader:
    print_performer_dict(line)



In [ ]: